Wide spreading data storage architecture

ABSTRACT

Technology is disclosed for a data storage architecture for providing enhanced storage resiliency for a data object. The data storage architecture can be implemented in a single-tier configuration and/or a multi-tier configuration. In the single-tier configuration, a data object is encoded, e.g., based on an erasure coding method, to generate many data fragments, which are stored across many storage devices. In the multi-tier configuration, a data object is encoded, e.g., based on an erasure coding method, to generate many data segments, which are sent to one or more tiers of storage nodes. Each of the storage nodes further encodes the data segment to generate many data fragments representing the data segment, which are stored across many storage devices associated with the storage node. The I/O operations for rebuilding the data in case of device failures is spread across many storage devices, which minimizes the wear of a given storage device.

TECHNICAL FIELD

Several of the disclosed embodiments relate to data storage, and more particularly, to data storage architecture for enhanced storage resiliency.

BACKGROUND

Commercial enterprises (e.g., companies) and others gather, store, and analyze an increasing amount of data. The trend now is to store and archive almost all data before making a decision on whether or not to analyze the stored data. Although the per unit cost associated with storing data has declined over time, the total costs for storage has increased for many companies because of the volumes of stored data. Hence, it is important for companies to find cost-effective ways to manage their data storage environments for storing and managing large quantities of data. There are several problems with traditional approaches to capacity storage. Most traditional storage systems have difficulty scaling to support billions of values, which is far small than the trillions of objects that customers are storing today.

Traditional data protection mechanisms, e.g., RAID, are increasingly ineffective in petabyte-scale systems as a result of: larger drive capacities (without commensurate increases in throughput), larger deployment sizes (mean time between faults is reduced) and lower quality drives. The trends from the hard drive vendors are making traditional RAID increasingly difficult to implement, and are requiring complex techniques, e.g., triple parity, declustering. Some of the storage device trends that push away from traditional data protection mechanisms include: increasing drive sizes, lower I/O limits on drives, varying latency (which can slow I/O), varying capacity (within a given model/drive line, which can increase inefficiency of traditional RAID, lower drive reliability (increased failure rates, and more intense workload-triggered failures). Thus, the traditional data protection mechanisms are ill-suited for the emerging capacity storage market needs.

Further, the current data storage systems have complex data protection mechanisms, which typically involve performing a significant amount of I/O on the storage devices in order to provide a specified storage resiliency. This intensive I/O for protection purposes together with the I/O performed for providing data access to the customers wears the storage device much faster and therefore, decreases the lifespan of the device rapidly. In order to maintain the same storage resiliency, the storage devices may have to be replaced with new ones regularly, which can drive up the storage costs.

In an object based storage system, certain meta-data, e.g., object size, creation date, owner, etc., are maintained for each object. In most of the current object storage systems, this metadata is kept in a database separate from the object data. Typically, this database is maintained in one or more different servers, e.g., meta-data servers. Ensuring that the objects themselves are consistent with the metadata in the metadata server is a difficult problem. The metadata servers themselves can become a bottleneck in the storage system, since they have to deal with updates every time an object is created, modified, or accessed. Typically, there is more than one meta-data server in order to address this bottleneck, but also to make sure that the meta-data is durable (not lost). The more such meta-data servers there are, the bigger the problem to keep them consistent with one another as well as the objects themselves.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a perspective plan view of a storage shelf and components therein, consistent with various embodiments.

FIG. 1B is a perspective view of a storage rack of storage shelves, consistent with various embodiments.

FIG. 2 is a block diagram of a storage shelf, in accordance with various embodiments.

FIG. 3 is a block diagram illustrating an environment in which a data storage architecture can be implemented, consistent with various embodiments.

FIG. 4 is a block diagram of a storage system implementing wide spreading storage architecture, consistent with various embodiments.

FIG. 5 is a block diagram for storing metadata of a data object with the data object in a storage system of FIG. 4, consistent with various embodiments.

FIG. 6 is a flow diagram of a process of storing data to an object-based storage system using the wide spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 7 is a flow diagram of a process of reading data from an object-based storage system using the wide spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 8 is a flow diagram of a process of rebuilding data fragments of a data object in the wide spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 9 is a flow diagram of a process of storing metadata of a data object with the data object in the wide spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 10 is a flow diagram of a process of processing metadata and data fragments of a data object in the wide spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 11 is a block diagram of a storage system implementing hierarchical spreading storage architecture, consistent with various embodiments.

FIG. 12 is a block diagram for storing metadata of a data object with the data object in a storage system of FIG. 11, consistent with various embodiments.

FIG. 13 is a flow diagram of a process of storing data to an object-based storage system using the hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 14 is a flow diagram of a process of reading data from an object-based storage system using the hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 15 is a flow diagram of a process of rebuilding data fragments of a data object in the hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 16 is a flow diagram of a process of rebuilding data segments of a data object in the hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 17 is a flow diagram of a process of deferred rebuilding of data segments of a data object in the hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 18 is a flow diagram of a process of processing metadata and data fragments of a data object in the hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology.

FIG. 19 is a block diagram of a computer system as may be used to implement features of some embodiments of the disclosed technology.

DETAILED DESCRIPTION

Technology is related to a data storage architecture for providing enhanced storage resiliency. Storage resiliency or data durability can be defined as a resistance to loss of one or more storage devices storing a portion of a data object or as a resistance to loss of one or more portions of the data object. The data storage architecture can be implemented in a single-tier configuration (also referred to as “wide spreading storage architecture”) and/or a multi-tier configuration (also referred to as “hierarchical spreading storage architecture”). In either of the architecture, additional redundant portions of the data object are generated and stored across a number of storage devices, e.g., to provide storage resiliency for the data object. In some embodiments, the number of redundant portions generated depends on a specified storage resiliency. In some embodiments, the redundant portions are generated by encoding the data object based on an erasure coding method. The encoding of the data object generates a number of data object fragments, which include redundant fragments. The encoded data fragments are stored across various storage devices.

In the single-tier configuration of the data storage architecture, a storage system includes a number of storage devices, for example, hundreds or thousands of storage devices. A data object can be split into a number of fragments and stored across the storage devices. In some embodiments, the data object is encoded based on an erasure coding method to generate a number of fragments. The fragments are distributed across the storage devices. In some embodiments, the storage resiliency of the data object depends on a storage layout of the fragments. For example, if most of the fragments are stored on the same storage device or storage devices in a same storage shelf, the storage resiliency can be lower, as loss of the storage device or the storage shelf can result in higher probability of data loss. In another example, spreading the fragments widely across a large number of storage devices or storage shelves can have a better storage resiliency.

The number of encoded data fragments generated depends on a specified storage resiliency. In some embodiments, a ratio of the total number of fragments “n” generated to a minimum number of fragments “k” required for reconstructing the object is a function of the specified storage resiliency. For example, if n/k is 130%, then the storage resiliency is 30%. That is, the storage system can tolerate or resist loss of 30% of the data fragments without losing the data object. If the number of storage devices is more than n, the storage system can tolerate or resist loss of up to n of storage devices without losing the data. To obtain a storage resiliency of 30%, the storage system generates 30% redundant fragments for the purposes of data protection. For example, if the minimum number of fragments, k, is “1000,” then the total number of fragments generated, n, is “1300”, and the same system above would be able to tolerate “300” storage devices failing before data can be lost. This illustrates the importance to data protection of having a large n. The n data fragments are then spread widely across the storage devices. The storage resiliency can also be represented in the form of equation, n=k+m, where “k” is the original amount of data fragments or the minimum number of data fragments required to regenerate or rebuild the data object, and variable “m” stands for the extra or redundant fragments that are added to provide protection from failures. The variable “n” is the total number of fragments created after the encoding process. The data object can be reconstructed, e.g., in response to a request from a client system, by obtaining at least k encoded data fragments and decoding those to regenerate the data object.

In some embodiments, such storage resiliency can also be provided to metadata of the data object. The metadata of the data object can be stored with the data object and spread across various storage devices. This eliminates the need to store the metadata of the data objects in a separate repository from that of the data objects.

The single-tier storage architecture provides a number of benefits over existing architectures, e.g., RAID storage architecture. For example, in the single-tier architecture a write and/or read is spread across a large number of storage devices as opposed to a small set of storage devices in RAID. The writes and reads of the data fragments can be performed in parallel across the storage devices. Additionally, the number of reads performed on the storage devices can be further minimized as only a subset of the total number of data fragments is required to be read for regenerating the data object, thereby increasing a lifespan of the storage devices and lowering latency of access.

Further, the number of read-write operations performed on a particular storage device to regenerate the data fragments due to loss of one or more storage devices is minimized as the reads and writes are spread across the storage devices. For example, if a set of data fragments are lost due to failure of a storage device, the set of data fragments can be reconstructed by obtaining at least k data fragments from the remaining of the storage devices and generating the replacement data fragments as a function of the obtained data fragments. In some embodiments, the k data fragments are obtained from a first set of storage devices and the replacement data fragments are stored on a different set of storage devices, which distributes the read/write operations across different set of storage devices, thereby minimizing the read-write operations on a particular storage device and increasing the lifespan of the particular storage device.

Additionally, in the single-storage architecture, the mean-time-to-repair, which is how quickly the failed drive has to be repaired and the data stored in the failed drive to be reconstructed in order to provide a certain storage resiliency, is lower than that of current storage systems, e.g., RAID. Continuing with the above example of 30% storage resiliency with m equal to “300”, the storage system can withstand loss of up to “300” drives. So the repair process can defer operation until a high percentage of those drives have failed. Similarly, the mean time between failure, which is a statistical measure of the time until a failure occurs, in the single-tier storage architecture is higher than that of current storage systems, e.g., RAID. For example, as described above since the storage system distributes the read/write operations across different sets of storage devices, the read-write operations on a particular storage device is minimized, which increases the lifespan of the particular storage device.

In the multi-tier configuration of the data storage architecture, the storage system includes a number of storage computer nodes which are each associated with a set of storage devices. The storage system encodes a data object into a number of data segments and distributes them to a number of storage computer nodes. Each of the storage computer nodes further encodes the data segment into a number of fragments and stores the fragments across storage devices associated with the storage computer node. For example, the storage system can encode the data object into “16” segments and send each of the “16” segments to different storage computer nodes. Each of the storage computer nodes can encode, independent of the other storage computer nodes, the segment into “16” fragments and store them across a set of storage devices associated with the storage computer node. The storage system can distribute the segments to a selected set of storage computer nodes and store the fragments at a selected set of storage devices based on a storage layout of the data object. The storage layout can be specified by a user, e.g., an administrator of the storage system, or calculated automatically based on operational characteristics of the storage system, e.g., capacity, load, wear, age and health.

The storage resiliency in multi-tier configuration of the data storage architecture is distributed between the tiers. For example, if storage resiliency in two level storage architecture is 30%, then the first tier of storage computer nodes could offer 15% storage resiliency, with the second tier of storage devices offering 15% storage resiliency. In some embodiments, this can mean that the storage system can generate 15% extra segments and 15% extra fragments for protection purposes.

In some embodiments, such storage resiliency can also be provided to metadata of the data object. The metadata of the data object can be stored with the data object and spread across various storage devices, which eliminates the need to store the metadata of the data objects in a separate repository from that of the data objects. For example, the metadata can be prefixed to the segments and/or fragments and stored across various storage devices.

One of the advantages of multi-tier storage architecture is localized data regeneration process. For example, if a storage device of a particular storage computer node fails, a fragment of a particular segment stored on the failed storage device can be regenerated using other fragments of the segment stored at other storage devices of the storage computer node. The storage system may not have to obtain fragments from other storage computer nodes. After the replacement fragment is generated, it can be stored at one of the remaining storage devices of the storage computer node. The reads and writes are restricted to the storage devices of a particular storage computer node. By restricting the reads and writes to the local storage devices of a storage computer node, the data traffic in the network, e.g., between storage computer nodes, is minimized, as is the amount of data that must be read from storage devices.

The storage system can store the data object across two or more tiers. For example, the storage system can have two tiers of storage computer nodes, where a first tier storage computer node can be associated with a number of second tier storage computer nodes and each of the second tier storage computer nodes can be associated with a set of storage devices. The data object is split into number of segments and the segments are sent to first tier storage computer nodes, where each first tier storage computer node splits the corresponding data segment into a number of fragments and distributes the fragments to a number of second tier storage computer nodes. Each of the second tier computer storage nodes splits the data fragment to a number of sub-fragments and stores the sub-fragments across a set of storage devices associated with the second tier storage computer node.

The storage devices of the storage system can be organized as storage shelves and storage racks, where each storage rack includes a number of storage shelves and each storage shelf includes a number of storage devices. The storage racks/shelves/devices can be distributed across various geographical locations.

Environment

FIG. 1A is a perspective plan view of a storage shelf 100 and components therein, consistent with various embodiments. The storage shelf 100 includes an enclosure shell 102 (partially shown) that encloses and protects multiple data storage devices 104. The data storage devices 104 may be hard drives, solid-state drives, flash drives, tape drives, or any combination thereof. It is noted that the term “enclose” does not necessarily require sealing the enclosure and does not necessarily require enveloping all sides of the enclosure.

The storage shelf 100 further includes control circuitry 106 that manages the power supply of the storage shelf 100, the data access to and from the data storage devices 104, and other storage operations to the data storage devices 104. The control circuitry 106 may implement each of its functions as a single component or a combination of separate components.

As shown, the storage shelf 100 is adapted as a rectangular prism that sits on an elongated surface 108 of the rectangular prism. Each of the data storage devices 104 may be stacked within the storage shelf 100. For example, the data storage devices 104 can stack on top of one another into columns. The control circuitry 106 can stack on top of one or more of the data storage devices 104 and one or more of the data storage devices 104 can also stack on top of the control circuitry 106.

In various embodiments, the enclosure shell 102 encloses the data storage devices 104 without providing window openings to access individual data storage devices or individual columns of data storage devices. In these embodiments, each of the storage shelves 100 is disposable such that after a specified number of the data storage devices 104 fail, the entire cartridge can be replaced as a whole instead of replacing individual failed data storage devices. Alternatively, the storage shelf 100 may be replaced after a specified time, e.g., corresponding to an expected lifetime.

The illustrated stacking of the data storage devices 104 in the storage shelf 100 enables a higher density of standard disk drives (e.g., 3.5 inch disk drives) in a standard shelf (e.g., a 19 inch width rack shelf). Each storage shelf 100 can store ten of the standard disk drives. In the cases that the data storage devices 104 are disk drives, the storage shelf 100A can hold the disk drives “flat” such that the spinning disks are parallel to the gravitational field.

The storage shelf 100 may include a handle 110 on one end of the enclosure shell 102 and a data connection port 112 (not shown) on the other end. The handle 110 is attached on an outer surface of the enclosure shell 102 to facilitate carrying of the storage shelf 100. The enclosure shell 102 exposes the handle 110 on its front surface. For example, the handle 110 may be a retractable handle that retracts to fit next to the front surface when not in use.

FIG. 1B is a perspective view of a storage rack 150 of storage shelves, consistent with various embodiments. The storage shelves may be instances of the storage shelf 100 illustrated in FIG. 1A. The storage rack 150, as illustrated, includes a tray structure 152 (e.g., a rack shelf) securing four instances of the storage shelf 100. The tray structure 152 can be a standard 2U 19″ deep rack mount. The storage rack 150 may include a stack of tray structures 152, each securely attached to a set of rails 162. Management devices 164 may be placed at the top shelves of the rack 150. For example, the management devices 164 may include network switches, power regulators, front-end storage appliances, or any combination thereof.

FIG. 2 is a block diagram of a storage shelf 200, in accordance with various embodiments. In some embodiments, the storage shelf 200 is the storage shelf 100 of FIG. 1A. The storage shelf 200 includes a processor 202, an operational memory 206, a boot flash 208, a data communication port 210, a power management module 212, storage interfaces 214, and data storage devices 216.

The processor 202 can be a microprocessor, a controller, an application specific integrated circuit, a field programmable gate array, or any combination thereof. The boot flash 208 is a memory device storing an operating system 218. The processor 202 can load the operating system 218 into the operational memory 206 and run the operating system 218. A data access application programming interface (API) service 220 can execute on this operating system to provide data access over a network to the data storage devices 216 for clients (e.g., devices, applications, or systems).

The data communication port 210 enables the storage shelf 200 to connect with the network. For example, the data communication port 210 can be a Power-over-Ethernet module that connects to an Ethernet cable to both establish a network connection with the network and power the storage shelf 200.

In various embodiments, the storage shelf 200 only turns on a subset (hereinafter the “active set”) of data storage devices 216 at a time. The active set can be a single data storage device or more than one data storage devices. The data access API service 220 can determine the membership of the active set depending on client requests received through the network. A client can either specifically request access to a data storage device or request a data range for the data access API service 220 to determine which data storage device stores the data range.

The power management module 212 provides electronic circuitry to switch on and off components of the storage shelf 200, e.g., to activate only one subset of the data storage devices at a time. The power management module 212 can receive instructions from the data processing module 202 (e.g., as part of the data access API service 220) to provide power to the designated active set, including a subset of the storage interfaces 214 that enables data access to the active set. Once power is supplied to the designated active set, the storage controller 222 can facilitate communicate between the data processing module 202 through the storage interface 214 to the data storage devices.

FIG. 3 is a block diagram illustrating an environment in which the data storage architecture can be implemented, consistent with various embodiments. The environment 300 includes a number of storage devices, e.g., storage device 304, which are organized as a number of storage shelves 306 a-n (collectively referred to as “storage subsystem 306”). In some embodiments, each of the storage shelves in the storage subsystem 306 can be similar to the storage shelf 100 of FIG. 1A and each of the storage devices, including the storage device 304, can be similar to the data storage devices 104 or the data storage devices 216 of FIG. 2. Further, the storage shelves 306 a-n can be part of one or more storage racks, e.g., storage rack 150. The storage subsystem 306 can be spread across various geographical locations.

The environment 300 includes one or more front-end subsystem 310 that facilitates storing and/or retrieving data from the storage subsystem 306. The front-end subsystem 310 processes the read/write requests from clients 312 a-c (collectively referred to as “clients 312”). In some embodiments, the storage subsystem 306 is implemented as an object storage system, which manages data as data objects. The front-end subsystem 310 stores the data received from the clients as data objects in the storage subsystem 306. The front-end subsystem 310 can receive the data from the clients as data objects or in other formats. If the front-end subsystem 310 receives the data in other formats, it can convert the data into data objects before storing the data in the storage subsystem 306. In some embodiments, the front-end subsystem 310 also stores the metadata of the data with the data objects.

The environment 300 supports both single-tier configuration and multi-tier configuration of the data storage architecture. In the single-tier storage architecture, the front-end subsystem 310 encodes the data object, e.g., received from a client, to generate a number of data fragments and stores the data fragments across one or more of the storage devices of the storage subsystem 306. In some embodiments, the front-end subsystem encodes the data object based on an erasure coding method. In some embodiments, an erasure coding method encodes the data object to generate n fragments. The n fragments include some redundant fragments which are generated for storage resiliency/data protection purpose. The erasure coding requires at least k out of n fragments to generate the data object. In some embodiments, the ratio of n to k indicates a storage resiliency of the data object.

In the multi-tier storage configuration, the environment 300 includes one or more tiers of hierarchical storage nodes, e.g., hierarchical storage nodes 314-318. Each of the hierarchical storage nodes 314-318 can be associated with a set of storage devices. For example, the hierarchical storage node 314 is associated with storage devices from storage shelves 306 a and 306 b, the hierarchical storage node 316 is associated with storage devices from storage shelf 306 c, and the hierarchical storage node 318 is associated with storage devices from storage shelves 306 d and 306 e.

In the multi-tier storage configuration, the front-end subsystem 310 encodes the data object, e.g., based on erasure coding, to generate a number of data segments and distributes them to a number of hierarchical storage nodes, e.g., hierarchical storage nodes 314-318. Each of the hierarchical storage nodes 314-318 further splits the data segment into a number of fragments and stores the fragments across storage devices associated with the hierarchical storage node. For example, the front-end subsystem 310 can split the data object into “3” segments and send each of the “3” segments to different hierarchical storage nodes 314-318. Each of the hierarchical storage nodes 314-318, e.g., hierarchical storage nodes 314 can split, independent of the other hierarchical storage nodes, the segment into “16” fragments and store them across a set of associated storage devices, e.g., storage devices from storage shelves 306 a and 306 b. The segments and fragments are distributed to a selected set of hierarchical storage nodes and storage devices, respectively, based on a storage layout of the data object. The storage layout can be specified by a user, e.g., an administrator of the storage system, or calculated automatically based on operational characteristics of the storage system, such as capacity, load, wear, age and health.

When a client system, e.g., client 312 a, requests to access the data object, a front-end subsystem 310 determines the storage layout of the data segments, requests the identified hierarchical storage nodes, e.g., one or more of the hierarchical storage nodes 314-318, to obtain the fragments of a segment from the storage devices and decode them to generate the segment, and decodes the segments to generate the data object. The front-end subsystem 310 returns the data object to the client 312 a. In some embodiments, the front-end subsystem 310 obtains at least the minimum number of segments required to regenerate the data object and the hierarchical storage nodes obtain at least the minimum number of fragments required to regenerate the data segment.

In some embodiments, both the single-tier configuration and multi-tier configuration of the data storage architecture can be implemented in the same storage system as illustrated in the environment 300. Further, in some embodiments, one of the two configurations is automatically and/or dynamically chosen for performing the read/write operations. A particular configuration can be selected based on a number of factors, e.g., type of data to be written, a client from whom the data is received, included metadata, etc. In some embodiments, the front-end subsystem 310 is configured to select the particular configuration based on the above factors.

FIG. 4 is a block diagram of storage system 400 implementing wide spreading storage architecture, consistent with various embodiments. In some embodiments, the storage system 400 can be implemented in the environment 300 of FIG. 3. The storage system 400 includes the front-end subsystem 310 that facilitates data storage and retrieval from the storage subsystem 306. The front-end subsystem 310 can be one or more computer systems (e.g., the computing device 1800 of FIG. 18), having either a shared nothing architecture or a shared database architecture, connected to the storage subsystems 306 over a network (e.g., a global network or a local network). The front-end subsystem 310 can be on a separate rack from the storage subsystem 306, or can be combined with the hierarchical storage node 314 or storage shelf 306.

The front-end subsystem 310 includes a protocol interfaces module 406. The protocol interfaces module 406 defines one or more functional interfaces that applications and devices use to store, retrieve, update, and delete data elements from the storage system 400. For example, the protocol interfaces module 406 can implement a Cloud Data Management Interface (CDMI), a Simple Storage Service (S3) interface, or both. The front-end subsystem 310 includes a staging area 408. The staging area 408 is a memory space implemented by one or more data storage devices within or accessible to the front-end subsystem 310. For example, the staging area 408 can be implemented by solid-state drives, hard disks, volatile memory, or any combination thereof. The staging area 408 can maintain an object namespace 410 to facilitate client interactions through the protocol interfaces module 406. The object namespace 410 manages a set of data container identifiers, e.g., object identifiers of data received from clients of the front-end subsystem 310. The staging area 408 also maintains a fragment namespace 412 corresponding to the object namespace 410. The fragment namespace 412 manages a set of fragment identifiers, each corresponding to a data fragment stored in the storage subsystem 306. The staging area 408 can store a mapping structure 414 that stores associations between the data container identifiers of the object namespace 410 and the fragment identifiers of the fragment namespace 412.

In some embodiments, the front-end subsystem 310 can be implemented as a distributed computing network including multiple computing nodes (e.g., computer servers). Each computing node can include an instance of the staging area 408. The namespaces (e.g., the object namespace 410 and the fragment namespace 412) of each staging area 408 can be implemented either as a share-nothing database or a shared database.

The staging area 408 can also serve as a temporary cache to process payload data from a write request received at the protocol interfaces module 406. The request module 416 receives read/write requests from the clients of the storage system 400. The front-end subsystem 310 processes an incoming write request by performing a number of storage efficiency processes on the payload data of the write request prior to sending the payload data into persistent storage in the storage subsystem 306. In some embodiments, the storage efficiency processes include deduplication, compression, fragmentation, erasure coding and fragment encryption of the payload data.

The storage processing module 430 performs the deduplication process on the payload data, which removes duplicate data portions from the payload data. The storage processing module 430 can use a number of deduplication techniques for deduplicating the payload data. The storage processing module 430 can compress the payload data, e.g., to reduce the storage space occupied by the payload data. The storage processing module can implement one or more compression algorithms for compressing the payload data.

The encode/decode module 418 fragments the payload data into a number of fragments, which includes redundant fragments for the purpose of data protection. In some embodiments, the encode/decode module 418 performs the encoding based on one or more erasure coding techniques. In some embodiments, erasure coding is a method of data protection in which payload data is broken into fragments, expanded and encoded with redundant data fragments. For example, payload data can be broken into k fragments and erasure coded data to generate n fragments, where n>k, such that the payload data can be recovered from a subset of the n, e.g., at least k fragments.

The storage processing module 430 can further encrypt the data fragments using one or more encryption techniques to generate encrypted data fragments. In some embodiments, the storage processing module 430 encrypts the fragments for data security purposes.

Note that the order of execution of storage efficiency processes is not restricted to the order described above. Alternative embodiments may perform these storage efficiency processes in a different order, and some processes may be removed, moved, added, subdivided, combined, and/or modified to provide alternatives or sub combinations.

The storage layout module 420 determines the storage layout of the data fragments. The storage layout identifies one or more of the storage racks, storage shelves of a rack and storage devices of a storage shelf the data fragments have to be stored in. In some embodiments, the storage layout module 420 determines the optimal layout of fragments to meet the service level object (SLO) promised to the client and/or to maximize storage resiliency, and sends the fragments to the selected storage devices of the storage subsystem 306 for storage. In some embodiments, a best storage layout stores each of the data fragments in a different storage device of the storage subsystem 306 to provide the best storage resiliency. In some embodiments, a worst storage layout stores all of the data fragments in the same storage device of the storage subsystem 306. Typically, the storage layout module 420 is configured to distribute the fragments across the storage devices as widely as possible, that is, to store distinct fragments on distinct storage devices.

In some embodiments, the storage layout module 420 selects the storage devices on a random basis. In some embodiments, the storage layout module 420 selects the storage devices on a random weighted basis. The storage layout module 420 can weigh the storage devices based on a number of factors, e.g., available storage capacity, a write latency of the storage device, a read latency of the storage device, a type of the storage device. For example, the storage layout module 420 can randomly select the storage devices from a set of storage devices that have at least some specified percentage of storage capacity free. In some embodiments, the random weighted basis attempts to store the data fragments evenly across the available storage devices. For example, one type of weighting is to decrease the weight if there are already a specified number of fragments stored on the storage device. In some embodiments, the random weighted basis randomly identifies the storage devices at which the encoded data fragments are to be stored as a function of decreasing the risk of data loss. For example, if a particular geographical region is prone to higher number of device failures, then the storage devices in that geographical region may be weighted less so that a lower number of fragments are written to the storage devices in that geographical region.

In some embodiments, the storage layout module 420 can select the storage devices based on parameters defined by a user, e.g., metadata, a client of the storage system 400, and/or an administrator of the storage system 400.

The following paragraphs describe additional details of writing data to the storage subsystem 306 in wide spreading storage architecture.

When a client, e.g., client 312 a, sends a write request to the storage system 400, the request module 416 receives the request and extracts the data object to be written from the request. The storage processing module 430 performs a number of processes on the data object, e.g., as described above. The encode/decode module 418 encodes the data to generate n fragments. The encode/decode module 418 can use an erasure coding method, e.g., Reed-Solomon, FEC code, Fountain code, Raptor code, Tornado code.

In FIG. 4, the encode/decode module 418 splits the data object 405 into n fragments, F₁ to F_(N). The storage layout module 420 determines the storage layout of the fragments and spreads the fragments, F₁ to F_(N) across the storage devices of the storage subsystem 306. For example, the storage layout module 420 determines that the fragments, F₁ to F₉₉ have to be sent to the storage devices of “storage shelf 1,” fragments, F₁₀₀ to F₁₉₉ to the storage devices of “storage shelf 2,” and fragments, F₂₀₀ to F_(N) to the storage devices of “storage shelf N.” In some embodiments, the storage layout module 420 also determines the storage devices of the storage shelves where the fragments have to be stored. After the storage layout module 420 determines the storage layout, the transceiver module 432 transmits the data fragments to the corresponding storage shelves, which store the data fragments at the storage devices. In some embodiments, the fragments can be written to the different storage devices in parallel.

The number of fragments generated by the encode/decode module 418 depends on the required storage resiliency. The storage resiliency offered can be represented as n=k+m, where variable “k” is the original amount of data fragments or the minimum number of data fragments required to regenerate or rebuild the data object, and variable “m” stands for the extra or redundant fragments that are added to provide protection from failures. The variable “n” is the total number of fragments created after the encoding process.

Typically, in the wide spreading data storage architecture, the width to which the data object is split is wider, and the degree to which the data fragments are spread across the storage devices is wider, e.g., compared to current storage architecture such as RAID. For example, the number of fragments to which the data object is split into can be in hundreds and the number of storage devices across which the hundreds of fragments are spread across can be in the thousands to tens of thousands.

In some embodiments, a ratio of “n” to “k” indicates the storage resiliency provided for the data object. For example, if n/k is 130%, then the storage resiliency is 30%. That is, the storage system can tolerate or resist loss of 30% of the data fragments without losing the data object. If the number of storage devices is more than n, the storage system can tolerate or resist loss of up to n of storage devices without losing the data. For example, if the minimum number of fragments, k, is “1000,” then the total number of fragments generated, n, is “1300.”, and the same system above would be able to tolerate “300” storage devices failing before data can be lost. This illustrates the importance to data protection of having a large n. To obtain a storage resiliency of 30%, the storage system generates 30% redundant fragments for the purposes of data protection. For example, if the minimum number of fragments, k, is “1000,” then “m” is “300” and n is “1300.” The n data fragments are then spread widely across “4000” storage devices.

The object identifier of the data object and the fragment identifiers of the fragments are stored in the staging area 408 at the object namespace 410 and the fragment namespace 412, respectively. Further, a mapping of the object identifier to the fragment identifiers can be stored in the mapping structure 414 of the staging area 408.

When a read request arrives at the storage system 400 from the client 312 a for the data object, the data object can be reconstructed by obtaining at least k number of the F_(N) data fragments and decoding them to regenerate the data object. The transceiver module 432 obtains the storage layout of the fragments from the storage layout module 420 and obtains the data fragments from the identified storage devices of the storage subsystem 306. The storage layout module 420 can use the mapping structure 414 to obtain the fragment identifiers of the data object and then determine the storage devices at which the corresponding fragments are stored.

The transceiver module 432 can obtain from k to n number of fragments. For example, the transceiver module 432 can stop fetching the fragments after obtaining the first k fragments. In another example, the transceiver module 432 can fetch all the n fragments but use only the first k fragments for regenerating the data object.

Further, the transceiver module 432 can preferentially select a subset of the storage devices identified by the storage layout module 420 to obtain the fragments from. The transceiver module 432 selects a storage device based on a number of factors, e.g., read latency of storage device, type of the storage device, number of pending read requests ahead of the current read request in a read request queue of the storage device, how far away the storage device is. Accordingly, the transceiver module 432 may not even read some of the storage devices that contain the data fragments of the data object, thereby minimizing read/write operation on the storage device. In some embodiments, the transceiver module 432 can obtain the fragments from different storage devices in parallel.

After obtaining the data fragments, the encode/decode module 418 decodes the data fragments, e.g., based on the erasure coding used to encode the data object, to generate the data object. In some embodiments, the storage processing module 430 may perform additional processes on the decoded object before returning the data object to the client 312 a. For example, the storage processing module 430 can perform decompression and de-deduplication on the decoded data object if the data object was deduplicated and compressed.

The wide spreading storage architecture provides a robust storage resiliency to the data stored in the storage subsystem 306. The wide spreading storage architecture also provides an efficient way to rebuild the data fragments in case of storage device failures. When a storage device fails, the data fragments stored at the storage device may be lost. When a failure detection module 424 detects a failure or impending failure of a storage device, the failure detection module 424 requests the regeneration module 428 to evacuate readable fragments or rebuild unreadable or lost data fragments to compensate for the ones that are no longer reliably stored. The regeneration module 428 facilitates rebuilding of new data fragments of a data object using the remaining data fragments of the data object stored at other storage devices. For example, if a storage device in “storage shelf 2” storing the data fragments F₄-F₁₀ fails, the regeneration module 428 can rebuild up to new six data fragments and writes the new data fragments to any of the remaining set of storage devices. In some embodiments, the regeneration module 428 rebuilds the data fragments using sufficient number of the remaining data fragments F₁-F₃ and F₁₁-F_(N). The regeneration module 428 can use the encoding method used to generate the initial fragments to generate the new replacement fragments.

The failed storage device can store data fragments of one or more data objects. The fragment/segment identification module 422 can determine the fragments stored on the storage device that failed, e.g., using the storage layout. The regeneration module 428 can rebuild the data fragments of all the data objects whose fragments are lost or for only a set of data objects that have lost the data fragments. For example, the regeneration module 428 can rebuild the data fragments of a data object whose current storage resiliency is lesser than a specified threshold for minimum storage resiliency. The current storage resiliency is determined as a function of the remaining of “n” number of fragments and “k.” For example, if the specified threshold for minimum storage resiliency of a data object is 10% and the current storage resiliency is less than 10%, then the data fragments can be rebuilt for the data object. Further, the regeneration module 428 can start rebuilding the data fragments of the data object whose current storage resiliency is lesser than the specified threshold instantaneously, e.g., in response to the failure of the storage device. The regeneration module 428 can rebuild the data fragments of other data objects whose current storage resiliency exceeds the specified threshold at a later time. In some embodiments, the regeneration module 428 executes the rebuilding process as a background process of the front-end subsystem 310. In some embodiments, a user, e.g., administrator of the storage system 400 can manually execute the rebuilding process.

The wide spreading storage architecture can resist higher number of storage device failures than that of current storage systems, e.g., RAID storage system. For example, if the storage system 400 offers a storage resiliency of 30% and has a k of 1000, then the storage system 400 can resist a failure of “300” storage devices before the data is lost. So if one or more storage devices are lost, or even if an entire storage shelf/storage rack is lost, there may not be much impact on the storage resiliency. This provides a number of advantages. First, the rebuilding process may not have to be started immediately; it can be done at a later time. The storage resiliency of the lost data fragments can be repaired over time, e.g., when the work load (data read-write operations) on the storage system 400 is below a threshold, or when the current storage resiliency drops below the specified threshold, e.g., when the current storage resiliency is less than 10%—which means the storage system 400 can only tolerate failure of “200” more storage devices. That is, the wide spreading storage architecture offers a high mean time to repair, e.g., compared to RAID storage architecture.

Second, the wide spreading storage architecture separates the rebuilding of data fragments from replacement of the failed storage devices. That is, the storage system 400 may not have to wait until the failed storage devices are replaced to rebuild the data fragments. The rebuilding process reads the data fragments of the data object from the remaining storage devices, generates new data fragments as a function of the data fragments obtained from the other storage devices, and writes the new data fragments on one or more of the remaining storage devices. Accordingly, in the wide spreading storage architecture, the storage system 400 does not have to wait for the failed storage device to be replaced to rebuild the data fragments, unlike current storage architectures, e.g., RAID storage architecture without hot spares, where a failed storage device may have to be replaced immediately upon failure.

However, if the failed storage device is replaced immediately upon failure, the storage system 400 can use the replacement storage device as additional capacity, e.g., to store new data. Further, the replacement storage device can be of different storage capacity and/or type from that of the failed storage device.

The wide spreading storage architecture also minimizes the number of read-write operations required per storage device for rebuilding the data fragments of a particular data object. The regeneration module 428 obtains the remaining data fragments of the particular data object from other storage devices of the storage subsystem 306. Since the data fragments are spread over a number of storage devices, the number of read operations performed for the rebuilding process is spread across many storage devices and therefore, the number of read operations performed on a particular storage device is limited. Further, in some embodiments, the regeneration module 428 obtains less than the remaining number of fragments, e.g., k fragments of the remaining fragments, to rebuild the lost data fragments, which further minimizes the read operations performed on the storage devices. By minimizing the read operations on a given storage device, the wear of the storage device is minimized and the lifespan of the storage device is therefore, increased. Further, as rebuild can be deferred and performed after many failures have occurred, rebuild operations are minimized compared to architectures were rebuilds are initiated for each failure operation.

Furthermore, after rebuilding the new data fragments, the new data fragments are written to a set of storage devices. In some embodiments, the set of storage devices to which the data is written is different from the set of storage devices from which the data fragments are read to rebuild the data fragments. Accordingly, the read-write operations performed on any given storage device is minimized, which minimizes the wear of the storage device and therefore, increases the lifespan of the storage device.

As described above, the wide spreading storage architecture provides optimum storage resiliency to data stored in the storage devices of the storage subsystem 306 while minimizing the wear of the storage devices.

The wide spreading storage architecture can also be used to store metadata of the data object. FIG. 5 is a block diagram 500 for storing metadata of a data object with the data object in a storage system 400 of FIG. 4, consistent with various embodiments. The wide spreading storage architecture can provide the same storage resiliency to the metadata of a data object that is provided to the data object. Examples of metadata can include, object ID, object size, object owner, creation time, created by, modified by, etc. The metadata can also include client-specified metadata, e.g., author of an object, name of entity, etc. Typically, current storage architectures store metadata separate from the data object. The wide spreading storage architecture enables storing the metadata with data object, thereby eliminating the need to have a separate database for the metadata, the need to have specific infrastructure to ensure the metadata is consistent with the data, etc.

When a write request is received, the payload data in the write request is analyzed to obtain the metadata 510 and the data portion, e.g., data object 405. The data object 405 is then encoded, e.g., using encode/decode module 418 as described with reference to FIG. 4, to generate a number of fragments 505. The metadata 510 is combined with some or each of the fragments 505, e.g., concatenated or prefixed to each of the fragments 505, to generate composite fragments 515. The composite fragments 515 can then be stored in the storage subsystem 306 by spreading them across a number of storage devices, e.g., similar to storing the data fragments as described with reference to FIG. 4. In some embodiments, the metadata 510 can be a subset of the metadata of the data object 405.

In some embodiments, by including the metadata 510 with the data object, the possibility of inconsistency between the metadata 510 and the data object 405 is eliminated. Further, since the metadata 510 is attached to the fragments 505, the composite fragments 515 can be moved across locations/storage devices without having to update the metadata 510 and without risking the consistency between the metadata 510 and the data object 405.

Another benefit of storing the metadata 510 with the data object 405 is that since a separate database and/or metadata server is not needed to maintain the metadata 510, the read and write operations are relatively faster since no separate read/write is required to read/write the metadata 510. In some embodiments, metadata retrieval is also simplified since a method call that is used for retrieving the data object 405 can be modified to use retrieve the metadata 510, which can simplify a number of functions performed related to the metadata 510.

FIG. 6 is a flow diagram of a process 600 of storing data to an object-based storage system using wide spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 600 may be implemented in environment 300 of FIG. 3, and using the storage system 400 of FIG. 4. The process 600 begins at block 605, and at block 610, a request module 416 of the frontend subsystem 310 receives a write request including payload data. In some embodiments, the payload data includes data portion and metadata of the data. If the data portion is not in a format suitable for storing in an object storage system, e.g., storage subsystem 306, the frontend subsystem 310 converts the data portion to the suitable format, e.g., as the data object.

At block 615, the encode/decode module 418 encodes the data object to generate a number of encoded data fragments, e.g., encoded data fragments F1-FN. In some embodiments, the encode/decode module 418 encodes the data object based on an erasure coding technique. The number of encoded data fragments generated can be expressed as a function, e.g., n=k+m, where variable “k” is the original amount of data fragments or the minimum number of data fragments required to regenerate or rebuild the data object, and variable “m” is the number of extra or redundant fragments added to provide protection from storage device failures. The variable “n” is the total number of fragments created after the encoding process.

After the encoded data fragments are generated, a mapping of the object identifier of the data object and fragment identifiers of the encoded data fragments are stored in the mapping structure 414.

In some embodiments, apart from encoding the data object to generate the fragments, various other processes may be performed on the data object, e.g., deduplication, compression, encryption. One or more of these processes can be performed by the storage processing module.

At block 620, the storage layout module 420 determines a storage layout for storing the encoded data fragments across a number of storage devices, e.g., storage devices of storage subsystem 306. In some embodiments, the storage layout module 420 is configured to spread the encoded data fragments across as many storage devices as possible, e.g., to provide better storage resiliency to the data object. That is, the storage layout module 420 attempts to identify different storage devices for storing different encoded data fragments. In some embodiments, the storage layout module 420 selects the storage devices on a random basis. In some embodiments, the storage layout module 420 selects the storage devices on a random weighted basis.

At block 625, the transceiver module 432 transmits the encoded data fragments to the identified storage devices. For example, the transceiver module 432 can transmit the encoded data fragments to the storage shelves and/or the storage racks which contain the storage devices.

At block 630, the storage shelves and/or the storage racks store the encoded data fragments at the identified storage devices, and the process 600 returns. In some embodiments, the front-end subsystem 310 also stores the metadata of the data object with the data object. Additional details with respect to the process of storing the metadata are described at least with reference to FIGS. 9 and 10.

FIG. 7 is a flow diagram of a process 700 of reading data from an object-based storage system using wide spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 700 may be implemented in environment 300 of FIG. 3, and using the storage system 400 of FIG. 4. The process 700 begins at block 705, and at block 710, a request module 416 of the frontend subsystem 310 receives a read request, e.g., from a client system 312 a, for obtaining a data object. In some embodiments, the read request includes an object identifier of the data object.

At block 715, the fragment/segment identification module 422, determines the encoded data fragments of the data object using the object identifier. In some embodiments, a mapping of the object identifier and the fragment identifiers of the encoded data fragments are stored in the mapping structure 414.

At block 720, the storage layout module 420 determines the storage layout of the encoded data fragments using the mapping obtained from the mapping structure. The storage layout can include identification information of the storage devices where each of the encoded data fragments is stored. In some embodiments, the storage layout information can also include identification information of the storage racks and/or storage shelves of the storage devices where the encoded data fragments are stored.

At block 725, the transceiver module 432 obtains sufficient number of the encoded data fragments required to generate the data object from the identified storage devices. In some embodiments, the sufficient number of encoded data fragments is k number of the encoded data fragments. In some embodiments, the transceiver module 432 can obtain k to n number of fragments. For example, the transceiver module 432 can stop fetching the fragments after obtaining the first k fragments. In another example, the transceiver module 432 can fetch all the n fragments but use only the first k fragments for regenerating the data object.

Further, the transceiver module 432 can preferentially select a subset of the identified storage devices to obtain the fragments from. The transceiver module 432 can select a storage device based on a number of factors, e.g., read latency of a storage device, type of the storage device, number of pending read requests ahead of the current read request in a read request queue of the storage device, a geographical location of the storage device. In some embodiments, the transceiver module 432 can obtain the fragments from different storage devices in parallel.

After obtaining the encoded data fragments, at block 730, the encode/decode module 418 decodes the encoded data fragments, e.g., based on the erasure coding method used to encode the data object, to generate the data object.

At block 735, the transceiver module 432 transmits the data object in response to the read request, e.g., to the client system 312 a, and the process 700 returns. In some embodiments, additional processes may be performed before decoding the data fragments. For example, the storage processing module 430 can decrypt the encoded data fragments if they were encrypted before being stored. In some embodiments, additional processes may be performed on the decoded data object before returning the data object to the client 312 a. For example, the storage processing module 430 can perform decompression and de-deduplication on the decoded data object if the data object was deduplicated and compressed.

FIG. 8 is a flow diagram of a process 800 of rebuilding data fragments of a data object in wide spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 800 may be implemented in environment 300 of FIG. 3, and using the storage system 400 of FIG. 4. In some embodiments, the data fragments stored in the storage subsystem 306 may be lost due to a failure of a storage device. The process 800 begins at block 805, and at block 810, a failure detection module 424 of the frontend subsystem 310 detects a failure of a storage device, e.g., storage device 304. In some embodiments, the failure can be one or more of the storage device being not accessible, the storage device being physically damaged, etc.

At block 815, the fragment/segment identification module 422 identifies the encoded data fragments that were stored at the storage device. For example, the fragment/segment identification module 422 can refer to the storage layout module 420 to determine the fragments stored at the storage device that has failed. Further, the fragment/segment identification module 422 identifies the one or more data objects corresponding to the identified encoded data fragments. For example, the fragment/segment identification module 422 can refer to the mapping structure 414 to determine the data objects associated with the identified encoded data fragments.

At block 820, the regeneration module 428 rebuilds some or all of the encoded data fragments that was stored at the storage device that failed. In some embodiments, rebuilding the data fragments include performing the method described in association with blocks 821-824 for each of the identified data objects. At block 821, the regeneration module 428 computes the current storage resiliency of the data object. In some embodiments, storage resiliency is defined as a resistance to loss of one or more storage devices storing a portion of a data object or resistance to loss of one or more portions of the data object. In some embodiments, a current storage resiliency of a data object is determined as a function of the number of fragments remaining out of “n” fragments and “k.” For example, if n is “130,” k is “100,” then the number of redundant fragments, m is “30,” and therefore, the storage resiliency can be calculated as 30% (100*m/k). Note that the storage resiliency can be calculated using other functions and based on several other parameters.

The storage system 400 may guarantee a storage resiliency range to the clients of the storage system, for example, a minimum storage resiliency and a maximum storage resiliency. In some embodiments, the storage resiliency range is part of the SLO guaranteed to the clients. In some embodiments, the storage system 400 may not rebuild the lost data fragments until the current storage resiliency of the data object drops below the minimum storage resiliency.

At determination block 822, the regeneration module 428 determines if the current storage resiliency of the data object is less than the minimum storage resiliency. Continuing with the above example of a storage resiliency of 30%, if the minimum storage resiliency is 10%, then the storage system 400 can withstand loss of “20” data fragments, in which case m is “10.”

Responsive to a determination that the current storage resiliency of the data object is not less than the minimum storage resiliency, the process 800 returns. On the other hand, responsive to a determination that the current storage resiliency is less than the minimum storage resiliency, at block 823, the transceiver module 432 obtains sufficient number of fragments of the data object from remaining of the storage devices. The transceiver module 432 may use the storage layout to identify the storage devices that store the data fragments of the data object. In some embodiments, the transceiver module 432 can obtain the minimum number of fragments required to rebuild the data fragments.

At block 824, the regeneration module 428 regenerates the data fragments as a function of the obtained data fragments and stores the regenerated data fragments in at least a subset of the remaining storage devices. In some embodiments, the regeneration module 428 regenerates as many data fragments as required to meet a specified storage resiliency, which can be up to the maximum storage resiliency. In some embodiments, regenerating the data fragments as a function of the obtained data fragments includes encoding the obtained data fragments to generate the new/replacement/additional data fragments. In some embodiments, regenerating the data fragments as a function of the obtained data fragments includes decoding the obtained data fragments to generate the data object and encoding the generated data object to generate the specified number of data fragments.

FIG. 9 is a flow diagram of a process 900 of storing metadata of a data object with the data object in wide spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 900 may be implemented in environment 300 of FIG. 3, and using the storage system 400 of FIG. 4. The process 900 begins at block 905, and at block 910, a request module 416 of the frontend subsystem 310 receives a write request including payload data. In some embodiments, the payload data includes data portion and metadata of the data. If the data portion is not in a format suitable for storing in an object storage system, e.g., storage subsystem 306, the frontend subsystem 310 converts the data portion to the suitable format, e.g., as the data object.

At block 915, the metadata processing module 426 analyzes the payload data to obtain the metadata of the data object, e.g., metadata 510 of FIG. 5. Examples of metadata can include, object ID, object size, object owner, creation time, created by, modified by, etc. The metadata can also include client-specified metadata, e.g., author of an object, name of entity, etc.

At block 920, the encode/decode module 418 encodes the data object to generate a number of encoded data pieces, e.g., segments and/or fragments. In some embodiments, the encode/decode module 418 encodes the data object as described at least with reference to FIGS. 4-6.

At block 925, after the encoded data pieces are generated, the metadata processing module 426 processes the encoded data pieces and the metadata for storage across a number of storage devices, e.g., storage devices of the storage subsystem 306, and the process 900 returns. Additional details with respect to the method of processing the metadata are described at least with reference to FIG. 10.

FIG. 10 is a flow diagram of a process 1000 of processing metadata and data fragments of a data object in wide spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 1000 may be implemented in environment 300 of FIG. 3, and using the storage system 400 of FIG. 4. In some embodiments, the process 1000 implements the method of block 925 of FIG. 9. The data piece generated in the process 900 of FIG. 9, e.g., in block 920, can be considered as a data fragment in the wide spreading storage architecture. The process 1000 begins at block 1005, and at block 1010, the metadata processing module 426 combines each of the data fragments of the data object with the metadata, e.g., metadata 510, to generate composite encoded data fragments, e.g., composite encoded data fragments 515. In some embodiments, combining the metadata with each of the fragments includes concatenating or prefixing the metadata to each of the fragments.

After the composite fragments are generated, at block 1015, the transceiver module 432 transmits the composite fragments to the storage subsystem 306 for storing across a number of storage devices, e.g., similar to storing the data fragments as described at least with reference to blocks 620-630 of FIG. 6, and the process 1000 returns. Prior to transmitting the composite fragments to the storage subsystem 306, the storage layout module 420 determines a storage layout for storing the composite data fragments across the number of storage devices, e.g., similar to determining the storage layout for storing the data fragments as described at least with reference to FIG. 4 and block 620 of FIG. 6. The transceiver module 432 then transmits the composite data fragments to the identified storage devices.

FIG. 11 is a block diagram of storage system 1100 implementing hierarchical spreading storage architecture, consistent with various embodiments. In some embodiments, the storage system 1100 can be implemented in the environment 300 of FIG. 3. Further, in some embodiments, the storage system 1100 includes at least some of the characteristics, behavior/functionalities of the storage system 400 of FIG. 4. In some embodiments, the wide spreading storage architecture of storage system 400 can also be implemented in the storage system 1100. The storage system 1100 includes the front-end subsystem 310 and a tier of hierarchical storage nodes, e.g., hierarchical storage nodes 314-318 that facilitate data storage and retrieval from the storage subsystem 306, which includes storage shelves 306 a-n. The hierarchical storage nodes can be implemented in a similar configuration to that of the front-end subsystem 310. For example, a hierarchical storage node can include the modules/components of the front-end subsystem 310 depicted in FIG. 3. Note that although FIG. 11 depicts one tier of hierarchical storage nodes, the hierarchical spreading storage architecture can have more than one tier of hierarchical storage nodes.

Each of the hierarchical storage nodes 314-318 can be associated with a set of storage devices. For example, the hierarchical storage node 314 is associated with storage devices from storage shelves 306 a and 306 b, the hierarchical storage node 316 is associated with storage devices from storage shelf 306 c, and the hierarchical storage node 318 is associated with storage devices from storage shelves 306 d and 306 e. In some embodiments, the hierarchical storage nodes are spread across various geographical locations. In other embodiments, the hierarchical storage nodes are integrated into each storage shelf.

The following paragraphs describe additional details of writing data to the storage subsystem 306 in hierarchical spreading storage architecture.

When a client, e.g., client 312 a, sends a write request to the storage system 1100, the request module 416 receives the request and extracts the data object to be written from the request. The encode/decode module 418 encodes the data object to generate a number of segments, e.g., “S1,” “S2,” and “S3”. In some embodiments, the encode/decode module 418 can use wide spreading, or an erasure coding method directly, e.g., Reed-Solomon, FEC coding, Fountain code, Raptor code, Tornado code, to generate the segments. In some embodiments, the number of segments generated is a function of the number of hierarchical storage nodes.

The transceiver module 432 distributes the data segments to a number of hierarchical storage nodes, e.g., hierarchical storage nodes 314-318. The storage layout module 420 determines the storage layout of the segments, that is, the hierarchical storage nodes to which the segments have to be distributed, and the transceiver module 432 spreads the segments to the identified the hierarchical storage nodes. In some embodiments, the storage layout module 420 is configured to select different hierarchical storage nodes for different segments, e.g., to maximize storage resiliency of the data object. However, in some embodiments, more than one segment may be transmitted to a hierarchical storage node. In some embodiments, the storage layout module 420 determines the hierarchical storage nodes to which the segments have to be distributed on a random basis. The storage layout can also be specified by a user, e.g., an administrator of the storage system 1100. In FIG. 11, the segment, “S1” is sent to the hierarchical storage node 314, the segment “S2” is sent to the hierarchical storage node 316 and the segment “S3” is sent to the hierarchical storage node 318. In some embodiments, the segments are transmitted to the hierarchical storage nodes in parallel.

The number of segments generated by the encode/decode module 418 can also depend on the required storage resiliency. The storage resiliency offered can be represented as n′=k′+m′, where variable K is the original amount of data segments or the minimum number of data segments required to rebuild the data object, and variable m′ stands for the extra or redundant segments added to provide protection from failures, e.g., failures of hierarchical storage nodes and/or storage devices associated with hierarchical storage nodes. The variable n′ is the total number of segments created after the encoding process.

The segment identifiers of the data object may be stored in the fragment namespace 412. The mapping structure 414 can store a mapping of the object identifier of the data object to the segment identifiers of the segments of the data object.

In some embodiments, prior to encoding the data object, the storage processing module 430 can perform a number of storage efficiency processes on the data object, e.g., as described at least with reference to FIG. 4.

Each of the hierarchical storage nodes 314-318 can encode, independent of the other hierarchical storage nodes, the segment, e.g., based on an erasure coding method, to generate a number of fragments of the segment. In some embodiments, the hierarchical storage node encodes the segment using an encode/decode module similar to the encode/decode module 418. In FIG. 11, the segments “S1,” “S2,” and “S3,” are each encoded to generate eight fragments F1-F8. Each of the hierarchical storage node stores the fragments, F1 to F8, across the storage devices of the storage subsystem 306. In some embodiments, the techniques involved in encoding a data segment to generate the fragments of a segment and storing the fragments across the storage devices is similar to the techniques involved in encoding a data object to generate the fragments of the data object and storing the fragments across the storage devices in wide spreading storage architecture, e.g., as described at least with reference to FIGS. 4 and 6.

For storing the fragments across a set of storage devices, the hierarchical storage node determines a storage layout of the fragments. The storage layout identifies one or more of the storage racks, storage shelves of a rack and storage devices of a storage shelf the data fragments have to be stored in. In some embodiments, the hierarchical storage node determines the storage layout of the fragments using a storage layout module similar to the storage layout module 420. After the storage layout is determined, the hierarchical storage node stores the fragments in the identified storage devices. In some embodiments, the hierarchical storage node writes the fragments to the different storage devices in parallel. In the hierarchical spreading storage architecture, the writes are more efficient than current storage systems. For example, in addition to writing the fragments of a particular segment in parallel, all the hierarchical storage nodes can write the fragments of their corresponding segments in parallel.

The hierarchical storage node stores the segment identifier of the data segment and the fragment identifiers of the fragments of the data segment in a staging area similar to the staging area 408. Further, the hierarchical storage node stores a mapping of the segment identifier of a segment to the fragment identifiers of the segment in a mapping structure similar to the mapping structure 414.

In the hierarchical spreading storage architecture, the storage resiliency provided for a data object is split across the tiers of a storage system. For example, if the storage resiliency offered for a data object by the storage system 1100 is 30%, then the first tier—hierarchical storage node 314-318 provides 15% of the storage resiliency and the second tier—storage devices provided the other 15%. The amount of storage resiliencies provided by each of the tiers can be configurable. However, the sum of storage resiliencies offered by the tiers may not exceed the total storage resiliency offered by the storage system 1100.

Referring to the read requests, when a read request arrives at the storage system 1100 from the client 312 a for a particular data object, the data object can be reconstructed by obtaining at least k′ number of the n′data segments and decoding them to regenerate the data object. The transceiver module 432 obtains the storage layout of the segments from the storage layout module 420 and obtains the data segments from the identified hierarchical storage nodes. The storage layout module 420 can obtain the segment identifiers of the segments of the data object from the mapping structure 414 and then determine from the storage layout the hierarchical storage nodes at which the corresponding segments are stored.

After the hierarchical storage nodes are identified, the transceiver module 432 requests the hierarchical storage nodes to return the data segments of the data object. The transceiver module 432 can obtain k′ to n′ number of segments for generating the data object. For example, the transceiver module 432 can stop fetching the segments after obtaining the first k′segments. In another example, the transceiver module 432 can fetch all the n′ segments but use only the first k′segments for regenerating the data object. Further, the transceiver module 432 can preferentially select a subset of identified the hierarchical storage nodes to obtain the segments from. The transceiver module 432 selects a hierarchical storage node based on a number of factors, e.g., a latency of the hierarchical storage node, a workload of the hierarchical storage node, a geographical location of the storage device. In some embodiments, the transceiver module 432 can obtain the segments from different storage nodes in parallel.

When a particular hierarchical storage node receives a request from the front-end subsystem 310 for a data segment, the hierarchical storage node obtains the fragments of the data segment from the storage devices associated with the hierarchical storage node. The hierarchical storage node determines the storage layout of the fragments and obtains a sufficient number of the data fragments, e.g., the minimum number data fragments required to generate the data segment, from the identified storage devices.

Further, the hierarchical storage node can preferentially select a subset of the storage devices to obtain the fragments from. The hierarchical storage node selects a storage device based on a number of factors, e.g., read latency of storage device, type of the storage device, number of pending read requests ahead of the current read request in a read request queue of the storage device, how far the storage device is. Accordingly, the hierarchical storage node may not even read some of the storage devices that contain the data fragments of the data object, thereby minimizing read/write operations on a particular storage device. In some embodiments, the hierarchical storage node can obtain the fragments in parallel.

After obtaining the data fragments, the hierarchical storage node decodes the data fragments, e.g., based on the erasure coding used to encode the data segment, to generate the data segment, and then returns the data segment to the front-end subsystem 310. In some embodiments, the hierarchical storage node may perform additional processes on the decoded data segment before returning it to the front-end subsystem 310. For example, the hierarchical storage node can perform decompression and de-deduplication on the decoded data segment if the data segment was deduplicated and compressed.

After the front-end subsystem 310 obtains sufficient number of the data segments from the hierarchical storage nodes, the front-end subsystem 310 decodes the data segments to generate the data object, and returns the data object to the client system 312 a. In some embodiments, the storage processing module 430 may perform additional processes on the decoded data object before returning the data object to the client 312 a. For example, the storage processing module 430 can perform decompression and de-deduplication on the decoded data object if the data object was deduplicated and compressed.

As described above, the hierarchical spreading storage architecture distributes the storage resiliency provided to the data across the storage tiers—hierarchical storage nodes 314-318 and storage devices of the storage subsystem 306. One of the advantages of such a distributed storage resiliency is that the storage system 1100 can withstand the loss of either some of the hierarchical storage nodes or some of the storage devices of a hierarchical storage node, or in some cases, both.

Another advantage of the hierarchical spreading storage architecture is that the rebuilding process can be localized in some cases. That is, when a storage device associated with a particular hierarchical storage node fails, the data fragments of a segment stored at the failed storage device may be rebuilt using the remaining data fragments of the segment stored within the storage shelves of the particular hierarchical storage node. The storage system 1100 may not have to obtain the fragments from the storage devices associated with another hierarchical storage node. For example, when a fragment F1 of the segment S1 is lost due to a failure of a storage device in the storage shelves 306 a-b, the hierarchical storage node rebuilds a new data fragment for the data segment S1 using the remaining data fragments, F2-F8, stored at other storage devices within the storage shelves 306 a-b. In some embodiments, the hierarchical storage node uses sufficient number of the data fragments, e.g., k number of the remaining data fragments to rebuild the new data fragment. The hierarchical storage node can use the encoding method used to generate the initial fragments to regenerate the new data fragment.

Localizing the rebuilding process to a particular hierarchical storage node minimizes the network traffic, e.g., between the hierarchical storage nodes and the front-end subsystem 310, between the hierarchical storage nodes, that might otherwise occur if the fragments are to be read from storage devices apart from that of the particular hierarchical storage node. This saves the time required for the fragments to traverse the network and therefore, can make the rebuilding process faster and more efficient. Further, localizing the rebuilding process to the storage devices of the particular hierarchical storage node, the read-write operations performed on storage devices of other hierarchical storage nodes is minimized, and therefore the wear of other storage devices is minimized.

The hierarchical storage node can rebuild the data fragments of all the data segments whose storage resiliency is affected or a subset of those data segments. In some embodiments, the hierarchical storage node rebuilds the data fragments for a particular data segment if the current storage resiliency of the data segment is below the minimum storage resiliency to be provided for the data segment, e.g., as described with reference to rebuilding the data fragments in FIGS. 4 and 8.

However, when a particular hierarchical storage node fails or a current storage resiliency of a data segment stored by the particular hierarchical storage node drops below the minimum storage resiliency the storage system 1100 uses the fragments from other hierarchical storage nodes to rebuild the lost fragments. For example, when the hierarchical storage node 314 fails, the front-end subsystem 310 obtains all or some of the remaining segments S2 and S3 from the remaining hierarchical storage nodes, generates a new segment S4 (not illustrated) and transmits it to another hierarchical storage node or one of the hierarchical storage nodes 316 and 318, which further encodes the new segment into fragments and stores them at its associated storage devices.

The hierarchical spreading storage architecture can also be used to store metadata of the data received from a client of the storage system 1100. FIG. 12 is a block diagram 1200 for storing metadata of a data object with the data object in a storage system 1100 of FIG. 11, consistent with various embodiments. The hierarchical spreading storage architecture can provide the same storage resiliency to the metadata of a data object that is provided to the data object. Examples of metadata can include, object ID, object size, object owner, creation time, created by, modified by, client-specified metadata, etc. Typically, metadata is stored separate from the data object. The hierarchical spreading storage architecture enables storing the metadata with the data object, thereby eliminating the need to have a separate database for metadata, the need to have specific infrastructure in place to ensure the metadata is consistent with the data, etc.

When a write request is received at the storage system 1100, the payload data in the write request is analyzed to obtain the metadata 510 and the data portion, e.g., data object 405. The data object 405 is then encoded, e.g., using encode/decode module 418, to generate a number of segments 1205, e.g., as described with reference to FIG. 11. The metadata 510 is combined with each of the segments 1205, e.g., concatenated or prefixed to each of the segments 1205, to generate composite segments 1210. In some embodiments, the metadata 510 can be a subset of the metadata of the data object 405. The composite segments 1210 can then be sent to a number of hierarchical storage nodes, e.g., as described with reference to FIG. 11 for further storage at a set of storage devices associated with the hierarchical storage nodes.

When a particular hierarchical storage node receives a composite data segment, it encodes the composite data segment to generate a number of data fragments such as fragments 1215. The metadata 510 is combined with each of the fragments 1215, e.g., concatenated or prefixed to each of the fragments 1215, to generate composite fragments 1220. The composite fragments 1220 can then be stored at the storage devices associated with the hierarchical storage node, e.g., as described with reference to FIG. 11.

Note that though FIG. 12 illustrates combining metadata 510 with both the data segments and the fragments, the metadata 510 can be combined with either the data segments or the data fragments.

In some embodiments, by storing the metadata 510 with the data object 405, the possibility of inconsistency between the metadata 510 and the data object 405 is eliminated. Further, since the metadata 510 is attached to the segments 1205 and/or fragments 1215, the composite segments 1210 can be moved across hierarchical storage nodes and the composite fragments 1220 can be moved across storage devices without having to update the metadata 510 and without risking the consistency between the metadata 510 and the data object 405.

In some embodiments, another benefit of storing the metadata 510 with the data object 405 is that since a separate database and/or metadata server is not needed to maintain the metadata 510, the read and write operations are relatively faster since no separate read/write is required to read/write the metadata 510. In some embodiments, metadata retrieval is also simplified since a method call that is used for retrieving the data object 405 can be modified to use retrieve the metadata 510, which can simplify a number of functions performed related to the metadata 510.

FIG. 13 is a flow diagram of a process 1300 of storing data to an object-based storage system using hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 1300 may be implemented in environment 300 of FIG. 3, and using the storage system 1100 of FIG. 11. The process 1300 begins at block 1305, and at block 1310, a request module 416 of the frontend subsystem 310 receives a write request including payload data. In some embodiments, the payload data includes data portion and metadata of the data. If the data portion is not in a format suitable for storing in an object storage system, e.g., storage subsystem 306, the frontend subsystem 310 converts the data portion to the suitable format, e.g., as the data object.

At block 1315, the encode/decode module 418 encodes the data object to generate a number of encoded data segments, e.g., encoded data segments S1-S3. In some embodiments, the encode/decode module 418 encodes the data object based on an erasure coding technique. The number of encoded data segments generated can be expressed as a function, e.g., n′=k′+m′, where variable k′ is the original amount of data segments or the minimum number of data segments required to regenerate or rebuild the data object, and variable m′stands for the extra or redundant segments that are added to provide protection from storage device/storage node failures. The variable n′ is the total number of segments created after the encoding process.

After the encoded data segments are generated, a mapping of the object identifier and the segment identifiers of the encoded data segments are stored in the mapping structure 414 in the staging area 408.

In some embodiments, apart from encoding the data object to generate the fragments, various other storage efficiency processes may be performed on the data object, e.g., deduplication, compression, encryption. One or more of these processes can be performed by the storage processing module 430.

At block 1320, the storage layout module 420 determines a storage layout for sending the encoded data segments across a number of hierarchical storage nodes, e.g., hierarchical storage nodes 314-318. In some embodiments, the storage layout module 420 is configured to spread the encoded data segments across as many hierarchical storage nodes as possible, e.g., to provide better storage resiliency to the data object. That is, the storage layout module 420 attempts to identify different hierarchical storage nodes for storing different encoded data segments. In some embodiments, the storage layout module 420 selects the hierarchical storage nodes on a random basis. In some embodiments, the storage layout module 420 selects the hierarchical storage nodes on a random weighted basis. In some embodiments, the random weighted basis attempts to store the data segments evenly across the hierarchical storage nodes. For example, one type of weighting is to decrease the weight if there are already a specified number of segments stored at the hierarchical storage node. In some embodiments, the random weighted basis randomly identifies the hierarchical storage nodes at which the encoded data segments are to be stored as a function of decreasing the risk of data loss. For example, if a particular geographical region is prone to higher number of device failures, then the storage nodes in that geographical region may be weighted less so that a lower number of segments are written to the storage nodes in that geographical region.

At block 1325, the transceiver module 432 transmits the encoded data segments to the identified hierarchical storage nodes. For example, the transceiver module 432 can transmit the encoded data segments S1-S3 to hierarchical storage nodes 314-318, respectively.

At block 1330, each of the hierarchical storage that receives an encoded data segment, processes the encoded data segment to store it at a set of storage devices associated with the hierarchical storage node, and the process 1300 returns. The processing can include encoding the data segment to generate a number of data fragments (block 1331). For example, the hierarchical storage node 314 encodes the data segment to generate fragments F1-F8. In some embodiments, the hierarchical storage node encodes the data segment based on an erasure coding technique. Also, the erasure coding technique used to generate the data segments can be different from that used for generating the fragments from the segment.

The hierarchical storage node includes a storage layout module, e.g., similar to the storage layout module 420, that determines a storage layout for storing the data fragments at a set of storage devices associated with the hierarchical storage node (block 1332). In some embodiments, the storage layout module is configured to spread the encoded data fragments across as many storage devices as possible, e.g., to provide better storage resiliency to the data object. After the storage layout is determined, the hierarchical storage node stores the encoded data fragments at the identified storage devices (block 1333).

In some embodiments, the front-end subsystem 310 also stores the metadata of the data object with the data segments and/or fragments. Additional details with respect to the process of storing the metadata is described at least with reference to FIGS. 9 and 17.

FIG. 14 is a flow diagram of a process 1400 of reading data from an object-based storage system using hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 1400 may be implemented in environment 300 of FIG. 3, and using the storage system 1100 of FIG. 11. The process 1400 begins at block 1405, and at block 1410, a request module 416 of the frontend subsystem 310 receives a read request, e.g., from a client system 312 a, for obtaining a data object. In some embodiments, the read request includes an object identifier of the data object.

At block 1415, the fragment/segment identification module 422, determines the encoded data segments of the data object using the object identifier. In some embodiments, a mapping of the object identifier and the encoded data segments are stored in the mapping structure 414 in the staging area 408.

At block 1420, the storage layout module 420 determines the storage layout of the encoded data segments using the mapping obtained from the mapping structure 414. The storage layout can include identification information of the hierarchical storage nodes where each of the encoded data segments are stored.

At block 1425, the transceiver module 432 identifies the hierarchical storage nodes that store sufficient number of the encoded data segments required to generate the data object. In some embodiments, the sufficient number of encoded data segments is k′ number of the encoded data segments. In some embodiments, the transceiver module 432 can obtain k′ to n′ number of segments. For example, the transceiver module 432 can stop fetching the segments after obtaining the first k′ segments. In another example, the transceiver module 432 can fetch all the n′ segments but use only the first k′ segments for regenerating the data object.

Further, the transceiver module 432 can preferentially select a subset of the identified hierarchical storage nodes to obtain the segments from. The transceiver module 432 can select a hierarchical storage node based on a number of factors, e.g., a read latency of the hierarchical storage node, type of the storage devices associated with hierarchical storage node, number of pending read requests ahead of the current read request in a read request queue of the hierarchical storage node, a geographical location of the hierarchical storage node.

After the hierarchical storage nodes are identified, the transceiver module 432 requests each of the hierarchical storage nodes for the data segment.

At block 1430, each of the identified hierarchical storage nodes performs a number of steps, e.g., 1431-1433, to obtain the data segment. At block 1431, the hierarchical storage node determines from a storage layout of the fragments, the set of storage devices that store sufficient number of the encoded data fragments required to generate the data segment. In some embodiments, the sufficient number of encoded data fragments is k number of the encoded data fragments. In some embodiments, the hierarchical storage node can obtain k to n number of fragments. For example, the hierarchical storage node can stop fetching the fragments after obtaining the first k fragments. In another example, the hierarchical storage node can fetch all the n fragments but use only the first k fragments for regenerating the data segment.

Further, the hierarchical storage node can preferentially select a subset of the identified storage devices to obtain the fragments from. The hierarchical storage node can select a storage device based on a number of factors, e.g., a read latency of the storage device, a type of the storage device, number of pending read requests ahead of the current read request in a read request queue of the storage device, a geographical location of the storage device. At block 1432, the hierarchical storage node obtains the sufficient number of fragments from the identified set of storage devices.

At block 1433, after obtaining the encoded data fragments, the hierarchical storage node decodes the encoded data fragments, e.g., based on the erasure coding method used to encode the data segment, to generate the data segment. After generating the data segment, the hierarchical storage node returns the data segment to the front-end subsystem 310. In some embodiments, additional processes may be performed before decoding the data fragments. For example, the hierarchical storage node can decrypt the encoded data fragments if they were encrypted before being stored. In some embodiments, additional processes may be performed on the decoded data segment before the data segment is returned to the front-end subsystem 310. For example, the hierarchical storage node can perform decompression and dededuplication on the decoded data segment if the data segment was deduplicated and compressed.

After obtaining sufficient number of the encoded data segments, at block 1435, the encode/decode module 418 of the front-end subsystem 310 decodes the encoded data segments, e.g., based on the erasure coding method used to encode the data object, to generate the data object.

At block 1440, the transceiver module 432 transmits the data object in response to the read request, e.g., to the client system 312 a, and the process 1400 returns. In some embodiments, additional processes may be performed before decoding the data segments. For example, the storage processing module 430 can decrypt the encoded data segments if they were encrypted before being stored. In some embodiments, additional processes may be performed on the decoded data object before it is returned to the client 312 a. For example, the storage processing module 430 can perform decompression and de-deduplication on the decoded data object if the data object was deduplicated and compressed.

FIG. 15 is a flow diagram of a process 1500 of rebuilding data fragments of a data object in hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 1500 may be implemented in environment 300 of FIG. 3, and using the storage system 1100 of FIG. 11. In some embodiments, the data fragments stored in the storage subsystem 306 may be lost due to a failure of a storage device. The process 1500 begins at block 1505, and at block 1510, a hierarchical storage node detects a failure of a storage device, e.g., storage device 304, associated with the hierarchical storage node. In some embodiments, the failure can be one or more of the storage device being not accessible, the storage device being physically damaged, the storage device determined to fail in a specified period, the storage device determined to fail in a specified number of read/write operations, etc.

At block 1515, the hierarchical storage node identifies the encoded data fragments that were stored at the storage device. For example, the hierarchical storage node can refer to the storage layout to determine the fragments stored at the storage device that has failed.

At block 1520, the hierarchical storage node identifies the one or more data segments corresponding to the identified encoded data fragments. For example, the hierarchical storage node can refer to the mapping structure to determine the data segments associated with the identified encoded data fragments.

At block 1525, the hierarchical storage node rebuilds some or all of the encoded data fragments that was stored at the storage device that failed. In some embodiments, rebuilding the data fragments include performing the method described in association with blocks 1526-1530 for each of the identified data segments.

At block 1526, the hierarchical storage node identifies the storage devices where the data fragments of the identified data segment are stored. The hierarchical storage node may use the storage layout determined by the storage layout module of the node to identify the storage devices that store the data fragments of the data segment. At block 1527, the hierarchical storage node computes the current storage resiliency of the data segment. In some embodiments, storage resiliency is defined as a resistance to loss of one or more storage devices storing a portion of a data segment or resistance to loss of one or more fragments of the data segment. In some embodiments, a current storage resiliency of a data segment is determined as a function of the number of fragments remaining out of n fragments and k. For example, if n is “10,” k is “8,” the number of redundant fragments, m is “2,” and therefore, the storage resiliency can be calculated as 25% (m/k*100). Note that the storage resiliency can be calculated using other functions and based on several other parameters. The storage system 1100 may guarantee a storage resiliency range to the clients of the storage system, for example, a minimum storage resiliency and a maximum storage resiliency. In some embodiments, the storage resiliency range is part of the SLO guaranteed to the clients. In some embodiments, the storage system 1100 may not rebuild the lost data fragments until the current storage resiliency of the data segment is or below the minimum storage resiliency.

At determination block 1528, the hierarchical storage node determines if the current storage resiliency of the data segment is less than the minimum storage resiliency. Responsive to a determination that the current storage resiliency of the data segment is not less than the minimum storage resiliency, the process 1500 returns. On the other hand, responsive to a determination that the current storage resiliency is less than the minimum storage resiliency, at block 1529, the hierarchical storage node obtains sufficient number of fragments of the data segment stored at the identified storage devices (e.g., identified in block 1526). In some embodiments, the hierarchical storage node can obtain the minimum number of fragments required to rebuild the data fragments.

At block 1529, the hierarchical storage node generates the replacement data fragments as a function of the obtained data fragments, and at block 1530, the hierarchical storage node stores the regenerated data fragments in at least a subset of the remaining storage devices. In some embodiments, the hierarchical storage node regenerates as many data fragments as required to meet a specified storage resiliency, which can be up to maximum storage resiliency. In some embodiments, regenerating the data fragments as a function of the obtained data fragments includes decoding the obtained data fragments to generate the data segment and encoding the generated data segment to generate the specified number of data fragments. In some embodiments, the hierarchical spreading storage performs the encoding and decoding using an erasure coding method.

FIG. 16 is a flow diagram of a process 1600 of rebuilding data segments of a data object in hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 1600 may be implemented in environment 300 of FIG. 3, and using the storage system 1100 of FIG. 11. In some embodiments, the data segments stored by a hierarchical storage node may be lost due to a failure of a storage device and/or a hierarchical storage node. The process 1600 begins at block 1605, and at block 1610, a failure detection module 424 of front-end subsystem 310 detects a failure of a hierarchical storage node and/or a failure of one or more storage devices of the hierarchical storage node that caused the storage resiliency of a particular data segment to drop. In some embodiments, the failure can be one or more of the storage device being not accessible, the storage device being physical damaged, the hierarchical storage node not being accessible, the storage device determined to fail in a specified period, the storage device determined to fail in a specified number of read/write operations, etc.

At block 1615, the fragment/segment identification module 422 identifies the encoded data segment stored by the hierarchical storage device. For example, the fragment/segment identification module 422 can refer to the storage layout to determine the segments stored at the particular hierarchical storage node that has failed.

At block 1620, the fragment/segment identification module 422 identifies the data object to which the encoded data segment corresponds. For example, the fragment/segment identification module 422 can refer to the mapping structure to determine the data segments associated with the identified data object.

At determination block 1625, the regeneration module 428 computes the current storage resiliency of the data object and determines if the storage resiliency of the object is below the specified minimum storage resiliency. In some embodiments, a current storage resiliency of a data object is determined as a function of the number of segments remaining out of n′segments and k′. For example, if n′ is “10,” k′ is “8,” the number of redundant segments, m′ is 2, and therefore, the storage resiliency can be calculated as 25% (m/k*100). Note that the storage resiliency can be calculated using other functions and based on several other parameters. In some embodiments, the storage system 1100 may not rebuild the lost data segments until the current storage resiliency of the data object is or below the minimum storage resiliency.

Responsive to a determination that the current storage resiliency of the data object is not less than the minimum storage resiliency, the process 1600 returns. On the other hand, responsive to a determination that the current storage resiliency is less than the minimum storage resiliency, at block 1630, the transceiver module 432 obtains sufficient number of segments of the data object stored at other hierarchical storage nodes. In some embodiments, the transceiver module 432 obtains the segments of the data object stored at other hierarchical storage nodes as described with at least with reference to blocks 1425-1433 of FIG. 14.

At block 1635, the regeneration module 428 generates the replacement data segment as a function of the obtained data segments. In some embodiments, the regeneration module 428 generates as many data segments as required to meet a specified storage resiliency for the data object, which can be up to a specified maximum storage resiliency of the data object. In some embodiments, regenerating the data segments as a function of the obtained data segments includes decoding the obtained data segments to generate the data object and encoding the generated data object to generate the specified number of data segments. In some embodiments, the hierarchical spreading storage performs the encoding and decoding using an erasure coding method.

At block 1640, the transceiver module 432 sends the regenerated data segments to one or more of the remaining storage devices for storage at their associated storage devices. In some embodiments, the transceiver module 432 transmits the replacement data segments of the data object to other hierarchical storage nodes as described with at least with reference to blocks 1320-1333 of FIG. 13.

FIG. 17 is a flow diagram of a process 1700 of deferred rebuilding of data segments of a data object in the hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 1700 may be implemented in environment 300 of FIG. 3, and using the storage system 1100 of FIG. 11. The rebuilding/regeneration process 1600 can consume significant system resources for regenerating the encoded data segments, e.g., network resources for reading at least K number of encoded data segments from other hierarchical storage nodes, computing resources of the corresponding hierarchical storage nodes in obtaining the fragments of the corresponding data segment and decoding them to generate the encoded data segment, etc. In some embodiments, the consumption of the system resources can be minimized by postponing or deferring the regeneration process 1600 until a later time, e.g., when the storage devices are replaced with new storage devices, when the data in the storage devices is migrated, etc.

In some embodiments, the generation of replacement data segments for the lost data segments is deferred until after one or more of the failed storage devices and/or one or more of the hierarchical storage nodes is replaced. That is, the regeneration process may not be executed during the lifetime of the storage devices and/or the hierarchical storage nodes. In some embodiments, the timing of the regeneration process is controlled based on m′, the number of redundant encoded data segments to be generated. As described above at least with reference to the regeneration process 1600, the regeneration process 1600 is triggered when the current storage resiliency of the data object drops below the minimum storage resiliency. The storage resiliency of a data object is a function of the total number of encoded data segments, n′, stored across the hierarchical storage nodes, which is a function of m′. The m′ can be determined such that the storage resiliency of the data object does not drop below the minimum storage resiliency during the lifespan of one or more of the storage devices. In other words, the number of encoded data segments generated are such that a loss of a subset of the encoded data segments does not drop the storage resiliency of the data object below the minimum storage resiliency during the lifespan of one or more of the storage devices. The following paragraphs describe the process 1700 in further detail.

The process 1700 begins at block 1705, and at block 1710, the regeneration module 428 obtains the historical information regarding a failure rate of storage devices of the type of the storage devices in the environment 300. The historical information can include a number of parameters that can describe and/or help determine the failure information of a storage device, e.g., an annual failure rate (AFR) of the storage device of a particular type, an AFR of the storage device based on a particular workload on the storage device, how long a storage device is expected to survive based on a particular workload. Such historical information can be gathered from various sources, gathered from the environment 300 over a period and/or can be input by a user such as an administrator of the environment 300.

At block 1715, the regeneration module 428 predicts the failure rate of the storage devices in the environment 300 and generates the predicted information. The regeneration module 428 can interpolate the historical information with various parameters of the storage devices in the environment 300, e.g., the number of storage devices in the environment 300, a workload of the storage devices, the number of read/write operations performed on the storage devices, a remaining life of the storage devices, and determine the predicted failure rate of the storage devices.

At block 1720, the regeneration module 428 determines the lifespan of the storage devices as a function of the historical information and the predicted information. At block 1725, the regeneration module 428 determines a statistical probability of a loss of a failure of one or more hierarchical storage nodes based on the determined lifespan of the storage devices. In some embodiments, a failure/loss of a hierarchical storage node is a function of the lifespan of the set of storage devices associated with the hierarchical storage node since a failure of one or more storage devices from the set can result in a failure of the hierarchical storage node. Further, a failure of the hierarchical storage node can result in a loss of the encoded data segment stored at the hierarchical storage node.

At block 1730, the regeneration module 428 determines the redundant number of encoded data segments, m′, to be generated for the data object based on the statistical probability of the loss of the hierarchical storage node. The regeneration module 428 notifies the encode/decode module 418 regarding the determined m′, and the encode/decode module 418 encodes the data object to generate the encoded data segments accordingly.

In some embodiments, the regeneration module 428 may continuously adjust m′, e.g., based on a specified schedule or certain events such as when storage devices are added or removed, to factor in any change in the parameters of the environment 300, e.g., change in workload on the storage devices, addition or removal or storage devices, etc.

Note that although the process 1700 is described as being performed by the regeneration module 428, the process 1700 can be performed by a combination of modules of the front-end subsystem 310 and/or sub-modules of the regeneration module 428 (not illustrated).

FIG. 18 is a flow diagram of a process 1800 of processing metadata and data fragments of a data object in hierarchical spreading storage architecture, consistent with various embodiments of the disclosed technology. In some embodiments, the process 1800 may be implemented in environment 300 of FIG. 3, and using the storage system 1100 of FIG. 11. In some embodiments, the process 1800 is an implementation of the method of block 925 of FIG. 9. The data piece generated in the process 900 of FIG. 9, e.g., in block 920, can be considered as a data segment in the hierarchical spreading storage architecture. The process 1800 begins at block 1805, and at block 1810, the metadata processing module 426 combines the metadata of a data object, e.g., metadata 510, with each of the segments, e.g., segments 1205, to generate composite segments, e.g., composite segments 1210. In some embodiments, combining the metadata with data segment can include concatenating the metadata with segment or prefixing a segment with the metadata. In some embodiments, the metadata 510 combined with segment can be a subset of the metadata of the data object 405.

After the composite segments are generated, at block 1815, the transceiver module 432 transmits the composite segments to a number of hierarchical storage nodes, e.g., as described at least with reference to blocks 1320 and 1325 of FIG. 13 for further storage at a set of storage devices associated with the hierarchical storage nodes.

At block 1820, when a particular hierarchical storage node receives a composite data segment, it encodes the composite data segment to generate a number of data fragments, e.g., fragments 1215 (block 1821). In some embodiments, the composite data segment is encoded to generate a number of data fragments as described at least with reference to block 1331 of FIG. 13.

At block 1822, the particular hierarchical storage node combines each of the fragments with the metadata, e.g., concatenates or prefixes the fragments 1215 with the metadata 510, to generate the composite fragments, e.g., composite fragments 1220.

After the composite fragments are generated, at block 1823, the particular hierarchical storage node stores the composite fragments at a set of storage devices associated with the hierarchical storage node, e.g., as described with reference to blocks 1332 and 1333 of FIG. 13.

Note that although FIG. 18 illustrates combining metadata 510 with both the data segments and the fragments, the metadata 510 can be combined with either the data segments or the data fragments.

FIG. 19 is a block diagram of a computer system as may be used to implement features of some embodiments of the disclosed technology. The computing system 1900 may be used to implement any of the entities, components or services depicted in the examples of FIGS. 1-17 (and any other components described in this specification). The computing system 1900 may include one or more central processing units (“processors”) 1905, memory 1910, input/output devices 1925 (e.g., keyboard and pointing devices, display devices), storage devices 1920 (e.g., disk drives), and network adapters 1930 (e.g., network interfaces) that are connected to an interconnect 1915. The interconnect 1915 is illustrated as an abstraction that represents any one or more separate physical buses, point to point connections, or both connected by appropriate bridges, adapters, or controllers. The interconnect 1915, therefore, may include, for example, a system bus, a Peripheral Component Interconnect (PCI) bus or PCI-Express bus, a HyperTransport or industry standard architecture (ISA) bus, a small computer system interface (SCSI) bus, a universal serial bus (USB), IIC (I2C) bus, or an Institute of Electrical and Electronics Engineers (IEEE) standard 1394 bus, also called “Firewire”.

The memory 1910 and storage devices 1920 are computer-readable storage media that may store instructions that implement at least portions of the described technology. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links may be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection. Thus, computer readable media can include computer-readable storage media (e.g., “non-transitory” media) and computer-readable transmission media.

The instructions stored in memory 1910 can be implemented as software and/or firmware to program the processor(s) 1905 to carry out actions described above. In some embodiments, such software or firmware may be initially provided to the computing system 1900 by downloading it from a remote system through the computing system 1900 (e.g., via network adapter 1930).

The technology introduced herein can be implemented by, for example, programmable circuitry (e.g., one or more microprocessors) programmed with software and/or firmware, or entirely in special-purpose hardwired (non-programmable) circuitry, or in a combination of such forms. Special-purpose hardwired circuitry may be in the form of, for example, one or more ASICs, PLDs, FPGAs, etc.

Remarks

The above description and drawings are illustrative and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding of the disclosure. However, in some instances, well-known details are not described in order to avoid obscuring the description. Further, various modifications may be made without deviating from the scope of the embodiments. Accordingly, the embodiments are not limited except as by the appended claims.

Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not for other embodiments.

The terms used in this specification generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context where each term is used. Some terms that are used to describe the disclosure are discussed below, or elsewhere in the specification, to provide additional guidance to the practitioner regarding the description of the disclosure. For convenience, some terms may be highlighted, for example using italics and/or quotation marks. The use of highlighting has no influence on the scope and meaning of a term; the scope and meaning of a term is the same, in the same context, whether or not it is highlighted. It will be appreciated that the same thing can be said in more than one way. One will recognize that “memory” is one form of a “storage” and that the terms may on occasion be used interchangeably.

Consequently, alternative language and synonyms may be used for any one or more of the terms discussed herein, nor is any special significance to be placed upon whether or not a term is elaborated or discussed herein. Synonyms for some terms are provided. A recital of one or more synonyms does not exclude the use of other synonyms. The use of examples anywhere in this specification including examples of any term discussed herein is illustrative only, and is not intended to further limit the scope and meaning of the disclosure or of any exemplified term. Likewise, the disclosure is not limited to various embodiments given in this specification.

Those skilled in the art will appreciate that the logic illustrated in each of the flow diagrams discussed above, may be altered in various ways. For example, the order of the logic may be rearranged, substeps may be performed in parallel, illustrated logic may be omitted; other logic may be included, etc.

Without intent to further limit the scope of the disclosure, examples of instruments, apparatus, methods and their related results according to the embodiments of the present disclosure are given below. Note that titles or subtitles may be used in the examples for convenience of a reader, which in no way should limit the scope of the disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. In the case of conflict, the present document, including definitions will control. 

I/we claim:
 1. A computer-implemented method comprising: receiving, at a storage management computer node of a storage management system, a write request including a data object; encoding, by the storage management computer node, the data object to generate a first specified number of multiple encoded data fragments, the encoded data fragments representing the data object, the first specified number of encoded data fragments including a second specified number of the encoded data fragments using which the data object can be regenerated; determining, by the storage management computer node, a storage layout of the encoded data fragments for storing the encoded data fragments at multiple storage devices, the storage devices grouped into multiple storage shelves, wherein a number of the storage devices is equal to or greater than the first specified number of the encoded data fragments; and transmitting, by the storage management computer node and based on the storage layout, to at least a subset of the storage shelves to store the encoded data fragments at the storage devices.
 2. The computer-implemented method of claim 1, wherein encoding the data object to generate the first specified number of the encoded data fragments includes generating the encoded data fragments based on a specified ratio of the first specified number to the second specified number.
 3. The computer-implemented method of claim 2, wherein the specified ratio is a function of a specified storage resiliency, the specified storage resiliency indicating resistance to at least one of a failure of a specified number of the storage devices or a loss of a specified number of the encoded data fragments without losing the data object.
 4. The computer-implemented method of claim 1, wherein encoding the data object includes: associating an object identifier with the data object, associating fragment identifiers with the encoded data fragments of the data object, and generating a mapping of the fragment identifiers to the object identifier.
 5. The computer-implemented method of claim 4, wherein the object identifier and the fragment identifiers are stored in different namespaces of the storage management computer node.
 6. The computer-implemented method of claim 1, wherein the encoding includes selecting a fragmentation technique to fragment the data object.
 7. The computer-implemented method of claim 6, wherein selecting the fragmentation technique includes selecting the fragmentation technique based on at least one of a deduplication binning requirement or an erasure coding requirement.
 8. The computer-implemented method of claim 1, wherein determining the storage layout includes determining the storage layout based on an attribute of the write request.
 9. The computer-implemented method of claim 8, wherein the attribute of the write request includes a service level objective (SLO) of the write request.
 10. The computer-implemented method of claim 8, wherein the attribute of the write request includes a specified storage resiliency, the specified storage resiliency indicating tolerance to failure of a specified number of the storage devices.
 11. The computer-implemented method of claim 1, wherein the storage layout includes a first identification information of the storage shelves at which each of the encoded data fragments are stored.
 12. The computer-implemented method of claim 11, wherein the storage layout includes a second identification information of a storage device within a storage shelf of the storage shelves at which each of the encoded data fragments is stored.
 13. The computer-implemented method of claim 1, wherein determining the storage layout includes determining the storage devices at which the encoded data fragments are to be stored on a random basis.
 14. The computer-implemented method of claim 1, wherein determining the storage layout includes determining the storage devices at which the encoded data fragments are to be stored on a random weighted basis.
 15. The computer-implemented method of claim 14, wherein the random weighted basis randomly identifies the storage devices at which the encoded data fragments are to be stored as a function of an available storage capacity at the storage devices.
 16. The computer-implemented method of claim 14, wherein determining the storage layout on the random weighted basis includes: determining that a first storage device of the storage devices has higher available storage capacity than a second storage device of the storage devices, and storing data at the first storage device at a higher rate than at the second storage device.
 17. The computer-implemented method of claim 14, wherein the random weighted basis distributes the encoded data fragments across the storage devices evenly.
 18. The computer-implemented method of claim 14, wherein the random weighted basis randomly identifies the storage devices at which the encoded data fragments are to be stored as a function of decreasing the risk of data loss.
 19. The computer-implemented method of claim 1, wherein a number of the storage shelves is at least the first number of the encoded data fragments divided by a largest number of storage devices per storage shelf of the storage shelves.
 20. A computer-readable storage medium storing computer-executable instructions comprising: instructions for receiving, at a storage management computer node of a storage management system, a read request for obtaining a data object stored at a storage subsystem, the read request including an object identifier of the data object; instructions for determining, by the storage management computer node and using the object identifier, multiple encoded data fragments of the data object, wherein the data object is stored at the storage subsystem as “N” number of the encoded data fragments, the “N” number of encoded data fragments including “K” number of the encoded data fragments using which the data object can be regenerated; instructions for determining, by the storage management computer node, a storage layout of the encoded data fragments, the storage layout including identification information of (a) one or more of multiple storage shelves of the storage subsystem that store the encoded data fragments and (b) multiple storage devices of the storage shelves that store the encoded data fragments, wherein a number of the storage devices is equal to or greater than “N”; and instructions for obtaining, by the storage management computer node and based on the storage layout, the encoded data fragments from the storage devices.
 21. The computer-readable storage medium of claim 20 further comprising: instructions for decoding the encoded data fragments obtained from the storage devices to regenerate the data object; and instructions for transmitting the data object from the storage management computer node in response to the request.
 22. The computer-readable storage medium of claim 20, wherein the “K” number of the encoded data fragments is a minimum number of encoded data fragments required to regenerate the data object.
 23. The computer-readable storage medium of claim 20, wherein the instructions for obtaining the encoded data fragments includes instructions for obtaining at least the “K” number of the encoded data fragments.
 24. The computer-readable storage medium of claim 23, wherein the instructions for obtaining the “K” number of the encoded data fragments includes: instructions for selecting a first “K” number of the encoded data fragments that arrive at the storage management computer node from the storage devices.
 25. The computer-readable storage medium of claim 23, wherein the instructions for obtaining the “K” number of the encoded data fragments includes: instructions for selecting a subset of the storage devices as a function of at least one of multiple attributes of a specified storage device of the storage devices, and instructions for obtaining the “K” number of the encoded data fragments from the subset of the storage devices.
 26. The computer-readable storage medium of claim 25, wherein the attributes of the specified storage device includes a read latency of the specified storage device, a number of pending read requests at the specified storage device, or a number of pending write requests at the specified storage device.
 27. The computer-readable storage medium of claim 20, wherein the encoded data fragments are generated from the data object based on a fragmentation technique to fragment the data object.
 28. The computer-readable storage medium of claim 27, wherein the fragmentation technique is based on an erasure coding technique.
 29. The computer-implemented method of claim 20, wherein a number of the storage shelves is at least the “N” number of the encoded data fragments divided by a largest number of storage devices per storage shelf of the storage shelves.
 30. A system comprising: a processor; a first module configured to receive a write request including a data object; a second module configured to encode the data object to generate a first specified number of multiple encoded data fragments out of which a second specified number of the encoded data fragments are used regenerate the data object, the encoded data fragments representing the data object; a third module configured to determine a storage layout for storing the encoded data fragments at multiple storage devices, the storage devices grouped into multiple storage shelves, wherein a number of the storage devices is equal to or greater than the first specified number of the encoded data fragments; and a fourth module configured to transmit the encoded data fragments to at least a subset of the storage shelves based on the storage layout to store the encoded data fragments at one or more of a set of the storage devices of each of the subset of the storage shelves.
 31. The system of claim 30, wherein the second module is further configured to encode the data object as a function of erasure coding technique. 